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Abstract 

Trace-driven cache simulation is central to computer design. A trace is a very long sequence, x u 
x N , of references to lines (contiguous locations) from main memory. At the t' instant, reference x, is 
hashed into a set of cache locations, the contents of which are then compared with x,. If at the t instant 
x, is not present in the cache, then it is said to be a miss, and is loaded into the cache set, possibly 
forcing the replacement of some other memory line, and making x, present for the (t + 1) instant^ 1 he 
problem of parallel simulation of a subtrace of N references directed to a C line cache set is considere , 
with the aim of determining which references are misses and related statistics. 

A simulation method is presented for the Least-Recently-Used (LRU) policy, which regardless of 
the set size C runs in time O(log N) using N processors on the exclusive read, exclusive write (EREW 
parallel model. A simpler LRU simulation algorithm is given that runs in 0{C log N) time using N/ log 
processors. We present timings of the second algorithm’s implementation on the MasPar MP-1, a machine 
with 16384 processors. A broad class of reference-based line replacement policies are considered, which 
includes LRU as well as the Least- Frequently-Used and Random replacement policies. A simulation 
method is presented for any such policy that on any trace of length N directed to a C line set runs in 
time 0{C log N) time with high probability using N processors on the EREW model. The algorithms 
are simple, have very little space overhead, and are well-suited for SIMD implementation. 
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and was initiated during a visit to AT&T Bell Laboratories. 
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1 Introduction 


A cache is a high-speed memory on the access path to a larger, slower main memory. Cache performance is 
critical to the overall performance of computer systems [10], and consequently a tremendous amount of effort 
is put into the evaluation of cache designs. This is particularly true for RISC microprocessor designs, where 
the ratio of the time needed to access an off-chip cache to that needed to access the main memory can be as 
high as 10 [10], and the off-chip cache is typically at least 10 times smaller than the main memory. Trace- 
driven simulations, which evaluate cache performance on actual reference streams taken from characteristic 
programs, are the most reliable and widely used tools for cache design evaluation. These simulations require a 
great deal of computation, because of the many different design possibilities that are simulated, and because 
of the length of the reference traces that drive the simulation [19]. 

Data is moved between main memory and the cache in contiguous blocks called lines. Every memory line 
is hashed to some fixed cache set, but may be placed in any one of the C physical cache lines in the set. In 
emerging computer designs, a microprocessor might be supported by a 1 Megabyte off-chip cache, with a line 
size of 128 bytes, and a set size C = 4. A miss occurs whenever a memory line is referenced, but is not found 
in its set. The cache hardware then fetches the desired line from main memory, overwriting another line in 
the same set if the set is full. The rule used to select which line to replace is called the replacement policy. 
An effective, widely used policy is Least-Recently-Used (LRU), which simply replaces the line accessed least 
recently. The objective of a trace-driven simulation is to determine which references in the trace are misses. 
Given the identities of the misses, statistics of chief interest in cache design are easily computed, such as the 
fraction of read misses, the fraction of write misses, and the number of write-backs (stores of modified lines) 
from cache to main memory. 

Heidelberger and Stone [9] showed that it is valuable to simulate a long trace directed to a few sets, when 
cache miss statistics between sets are highly correlated. 1 High correlation removes the need to simulate all 
sets, but also removes the easy parallelism that might be exploited by simulating a large number of sets in 
parallel on different processing elements (PEs). A massively parallel method to handle the simulation of a 
long trace targeted to a single set allows more powerful, flexible solutions. 

We consider the problem of determining the misses in a given reference trace, xi , ..., xw, directed to 
a set of size C. An algorithm is presented (Section 3) that solves this problem in 0(\ogN) time using 
N/ log N PEs, on the exclusive-read, exclusive-write (EREW) model of a parallel machine. The algorithm 
and its complexity do not depend on C. The algorithm computes the stack distance A, associated with each 
reference x ( [16]. If x, is not a first reference to a line then A, is the smallest set size for which x, would be 

a hit; otherwise A t = oo. 

In Section 4, we present an alternative LRU simulation, with running time 0(C log N) time using N / log A' 


1 Recent experiment, (private communication from Harold Stone) have validated that high correlation exists between sets 
but hl io Shown that special care must be taken when selecting the sets which are analyzed, as the measured m.ss ratio 
from an arbitrary set simulation may not be an accurate predictor of the overall m.ss ratio. 
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PEs on the EREW model. The algorithm computes the stack distance at level C ) A ( (C), for each reference 
x t . If x t is not a first reference to a line then A*(C) is the smallest set size < C for which x t would be a hit; 
otherwise A*(C) = oo. The algorithm is simple and the implicit constant in the time bound is favorable. 
We report timings of this algorithm’s performance on a MasPar [4] SIMD computer having 16384 PEs. 

In Section 5, a broad class of reference-based replacement policies is considered. Roughly, the class 
contains all stack replacement policies where priorities controlling line replacement are static and can be 
computed efficiently in parallel. This class includes LRU as well as: 

• OPT: Replace the line referenced most remotely in the future. This unrealizable policy provably 
minimizes the number of misses. Its simulation gives a baseline against which realizable policies can 
be measured. 

• Least-Frequently-Used or LFU: Replace the line accessed least often in the past. Ties can be broken 
by, for example, giving higher priority to the reference that has been in the cache the shortest length 
of time. 

• Random: Replace one of the C lines, chosen independently and uniformly at random. Random re- 
placement is easy to implement; furthermore, there is evidence that if the total number of lines in the 
cache (not just the lines in one set) is sufficiently large, the policy works nearly as well as any other 
implementabie policy [10]. 

In Section 5, an algorithm is presented for re fere nee- based policy simulation. Given any trace of N references 
targeted to a C line set, the algorithm runs in time in 0(C log N) with high probability using N PEs on 
the EREW model. (The algorithm is probabilistic, the choice of trace is not.) In Section 5.3, we extend the 
class of reference-based replacement policies to include an aging mechanism, whereby stale lines lose priority 
and tend to be flushed from the cache. Accommodating this mechanism increases the algorithm’s running 
time to 0(C log 2 N). 

Out algorithms are simple, require at most O(logN) space per PE, and break the computation down 
into calls to a few primitive parallel subroutines. As a result the algorithms are well-suited for SIMD 
architectures, such as the Connection Machine [11] or MasPar [4]. The O(CTogA) with high probability 
bound holds because we have assumed that a fast probabilistic parallel algorithm [18] is used to solve a 
certain trapezoidal decomposition problem (Sections 2, 5). Adopting the notation of [18], this algorithm 
runs in G(log N) time using N PEs, meaning that there is a constant k such that the time exceeds Arm log N 
with probability less than N~ m for any m > 1. In practice, simpler, deterministic methods may do better, 
while raising the asymptotic time bound to 0(C log 2 N). 

For simplicity, we have assumed the problem size N is comparable to the number of PEs, so that it is as 
if each PE handles a few references (up to logN). However, a “supersaturated” setup [8] may be effective 
in practice, where a large block of consecutive references would be loaded in the local memory of each PE. 
Our algorithms generalize to that, setup, by using efficient supersaturated implementations of the underlying 
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parallel primitives (cf. [12, 17]). Indeed, our implementation of the LRU algorithm is a supersaturated 
one, with complexity 0(C(N/P + log P)) for a reference trace with N elements on an architecture with P 
processors. 

Collecting the cache miss statistics mentioned above adds just 0(logN) time. Moreover, by the nature 
of the replacement policies and the simulation methods, statistics for each set size up to C can be computed 
at this cost. All of our algorithms can be adapted for efficient simultaneous simulation of many sets, by the 
simple device of initially sorting the references on the basis of their set identifiers. 

Heidelberger and Stone [9] had the original insight that trace-driven simulation of an LRU cache set 
could be parallelized. Their algorithm is intended for a network of P MIMD processors, and requires P<iV 
for good speedup. Our work was motivated by theirs; our algorithms are different, apply to a larger class 
of replacement policies, and to a different class of architectures. Lin, Baer, and Lazowska have considered 
parallelizing cache simulations, in the context of multiprocessor cache protocols[15]. Their method assumes 
that each individual processor’s cache is simulated on a different PE, so that the degree of parallelism is 
limited to the number of caches in the simulated system. An important and beautiful paper on cache 
simulation was published in 1970 by Mattson, Gecsei, Slutz, and Traiger[16], Most of our notation is taken 
from that paper. 

The practical utility of implementing trace-driven cache simulations on today’s SIMD computers has yet 
to be shown, although our implementation proves the great promise of the approach. It seems likely that a 
very long reference trace will have to be partitioned into blocks, where one block is processed at a time. The 
I/O problem is to move the blocks to the processors fast enough to keep them busy. An attractive alternative 
is to use a synthetic trace; for example Thiebaubt, Stone, and Wolf [20] recently proposed a simple method 
for random generation of realistic traces. 

2 Preliminaries 

2.1 Cache Notation 

Henceforth, we focus on a single set cache, and treat its size C as a parameter. Let Bt(C) denote the set of 
lines stored just after reference x t . Each reference must be cached, so x t € B>(C) for all (' > 1 and t > 1. 
(By convention, B<(0) is the empty set.) If the cache is full (|JB<(C)| = C) and it is a miss (x< $ Bi_i(C)) 
then x t replaces a line in £<_i(C). We refer to this replaced line as y t . All of the replacement policies we 
consider are stack policies [16], meaning if a reference is a hit given that the cache size is C then the reference 
will remain a hit if the cache size is increased to C + 1. That is, 

B t {C) C B t (C + 1) for all <7 >0. 

This inclusion allows us to order the lines of the cache by the least size needed for their appearance. 
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Figure 1: An example of the LRU rule acting on an N = 18 line trace, with lines labeled a — d\ 0 is the 
empty line marker. 

Define the i th element of B t (C ) as 

1) if \Bt{i)\ = i 

0 otherwise 

The symbol 0 is an empty line marker; s$(i) = 0 if fewer than i lines belong to We assume the stack 

starts empty, and therefore let so(i) = 0 for all i > 0. 

Figure 1 gives an example, for the LRU replacement policy. The trace length Af =18, and the lines are 
labeled a — d. The first level, s*(l) coincides with the trace itself. Consider the first two levels; i.e., the cache 
contents given C = 2. The cache is initially empty. The first two references, x\ — a } x 2 — c, miss, and as 
a result B\{ 2) = {a}, Bo(2) = {a,c}. The third reference, £3 = 6, also misses, forcing the replacement of 
y 3 = a, yielding £3(2) = { 6 , c} . The fourth reference, x\ — c hits, so #4(2) = £3(2), and so forth. 

Hit and miss statistics are easily extracted from stack distances, defined as follows. Let the level i stack 
distance A *(*) denote the smallest cache size < i such that x t is a hit, or 00 if x t is a miss for cache size 
i. Thus, given A t (C), for t = 1, . . . , N , we can extract the hits for any cache size c < C. More generally, 
define the stack distance A t = limc^oo A t (C) to be the smallest cache size such that x t is a hit, or 00 if 
there is no prior reference to the same line ( x s — Xt for some s < t). Stack distances are shown in Figure 1. 

2.2 Parallel Processing Model 

Our algorithms are well-suited for a wide variety of parallel architectures, because the algorithms transform 
data in simple ways using a small number of basic, highly parallelizable operations. However, to state 
precise time and processor requirements, we must choose a precise model of parallel computation. The 
EREW (exclusive read, exclusive write) model (cf. [14]) provides a nice blend of simplicity and realism. 
In this model, the PEs operate in lockstep. There is a global shared memory, supporting at unit cost any 
pattern of accesses except those where two PEs simultaneously access the same location. 

We state the complexity of our algorithms with respect to the EREW model. We now list the parallel 
subroutines used in our algorithms, and their complexities on the EREW model. 

• Merging: Two sorted lists each of length N can be merged in time O(logiV) using Nf log N PEs, via 
Batcher’s odd-even merge algorithm [2]. 
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• Sorting: A list of length N can be sorted in time O(logiV) time using N PEs [5], 

• 2d Ranking: Given points i = 1, .... N, compute for each point (ar,,y,) its rank, the number 

of other points (xj, yj ) strictly above and to the right: x ( < Xj and y, < Vj . In slightly different form, 
this is the problem of computing the empirical cumulative distribution function (ECDF), considered 
in [3]. The multidimensional divide- and-conquer serial algorithm given in [3] parallelizes easily to solve 
the problem in 0( log 2 N) time using N PEs. Atallah et al. improved on this, lowering the time to 
0(logAO using N PEs. 

• Closest Larger Right Neighbor (CLRN) Problem: Given input numbers a,, . . a N , find, for each a,, 

the index of the first larger number to the right; i.e., for each i = 1, . . . , N - 1, compute 6, = min {j > 
i : xj > Xi } if there is some j > i with xj > x it 6, = N + 1 otherwise. The CLRN problem can be 
reduced to trapezoidal decomposition [18]: given a set of line segments and points, from each point, 
report the line segment first hit (if any) by a ray shot horizontally to the right. To make the reduction, 
consider the polygonal path connecting consecutive points (*, a s -) a i - 1, . .., N. If a i+i > a { then we 
know bi = a*+i. Otherwise, b t is the height of the right end point aj of the segment from (j - 1, aj_i) 
to (j, aj) first hit by the ray shot horizontally to the right from if there is an aj > j > i. 

If not then 6, — N + 1. Reif and Sen give a probabilistic algorithm for trapezoidal decomposition. 
Applying that algorithm to the CLRN problems yields its solution in O(logjV) time using N PES. 
Alternatively, the CLRN problem can be solved by a binary search-like algorithm, given in Section 5, 
in 0( log 2 N) time using N PEs. 

• Parallel Prefix (scan, segmented scan): Given inputs a\ } . . . , ajv and an associative operator o, compute 
the partial products pi, . . .j>A r where pi = fli oa 2 o . . . o a t *. Solutions to this parallel prefix problem [13] 
are commonly called scan computations. The problem can be solved in 0(Iog A^) time using N / log N 
PEs [12]. 

A variation breaks the products over the indices [1 , TV] into segments over these indices, with the 
segment boundaries also given as inputs. For example, an additional vector bi , . . . , 6/v , is given where 
&i = 0, for i > 1, bi is either 0 or 1, and the 0’s mark the segments’ left boundaries. Specifically, if 
b{ = 0 then pi = a, ; otherwise, p, = aj o a J+1 o . . . o a,* where j is the largest index k, 1 < k < i, such 
that bk = 0. The segmented problem has the same complexity as the original. In the algorithms below, 
we use copy-scans defined by a o /? = a, and add-scans where o is addition. 

None of the algorithms listed above requires more than 0(\ogN) space per PE. 

3 Fast Parallel LRU Simulation 

In this section we present a fast parallel algorithm for computing stack distances under the LRU replacement 
policy. 
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LRU may be characterized as follows. Reference to a = x, places a at the first level of the stack. Until 
a is referenced again, it can only move down in the stack. Specifically, after a has been pushed to level i it 
remains there until a reference is made either to a (moving a to level 1 ) or to a line not stored in levels 1 
through » - 1 (moving a to level i + 1). As a result, the stack distance A t is one greater than the number of 
distinct lines in the subtrace between t and the closest prior reference to a (or oo if there is no prior reference 
to a). For example, in Figure 1, consider the consecutive references to line b at t = 3 and t = 10. The stack 
distance Aio = 4 because 3 distinct symbols belong to the subtrace x+, . . . , * 9 . More generally, letting 

f max{s < t : x A = X*} if x s = for some s < t 
prev(t) “ | 0 otherwise 

we obtain 

{ 1 + number of distinct symbols in x prev ( t ) + i, ■ • - 1 * 1-1 if prev(t) > 0 
qq otherwise 

Let us take a geometric view of this new problem of counting distinct symbols within subtraces. As 

illustrated in Figure 2, identify each reference x t with the point (t, next(t)), where 

f max{s > t : x t = Xt) if x s = Xj for some s > t 
nexiyt) — ^ j otherwise 

Note that the last references to symbols within the subtrace x pre „(« )+ i, *t-i are identified by those points 
(s,next(s)) satisfying 

prev(t) < s <t < next(s). 

These are the points that lie strictly within the rectangle with lower left hand corner (prev(t),t), lower right 
hand corner (M) and sides extending upwards to ( prev(t),N + 1) and (t,N + 1). Again, see Figure 2. 
Counting these points reduces to 2d-ranking. Specifically, suppose we know the 2d-rank, rank{u,v), of each 
point (u, v) in the union of sets {(t,next(t)) : 1 < t < N} and : 1 <t<N}. Then, the stack distance 

_ f rank(prev(t),t) - rank(t,t) if prev(t) > 0 
A ‘ - ) oo otherwise 

We see from Figure 2 that A 10 = 4 because rank( 3, 10) = 20 and ran*(10, 10) = 16. 

Now, let us present the detailed simulation method. Suppose that the trace is initially stored in the 
JV-vector x. We use the additional W- vectors p, next, prev, and A. Initially, let p, = t, so x« identifies 
the line and p, the trace index of reference x t . Vectors next, prev, and A will hold permuted copies of the 
vectors next, prev, and A, respectively. The algorithm is as follows. 

1. [Compute next and prev. } Sort the tuples (x„p,) using x, as the primary key and p ( as the secondary 

key: (x t ,p<) < (x,,Pj) if either x< < x, or x, = x, and p t < p,. Thus, the data now in location t of x 

and p was in location pt before the sort. For all t = 1, • - • , V, set 

f p t _i if t > 1 and x*_i = x t 
prevj — | q otherwise 
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Figure 2: The trace of Figure 1 is repeated, along with corresponding next and prev values. The values 
(i, next(t)) are plotted as +’s, and the values (t,t) as o’ s. The number of points strictly within the rectangle 
indicated by dashed lines is one less than the stack distance of £10 — b. 


next t = 


p t+1 if t < N and x t +i = x* 
N - hi otherwise 


At this point, the prev and next vectors hold permuted copies of the prev and next vectors discussed 
above. 

2. [2d-rank.] Compute the 2d-ranks of the set of points 

{(p t , nextt) : 1 < t < N} U {(<,<) : 1 < t < N} 
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and set 


bt = rank{ prev^p,) - rank{ p,,p t ). 

As a result, At = Ap ( , which completes the computation. 

Sorting within the first step costs O(logA) time on N PEs, using the EREW model [5], The n«xt and 
prev computations may be done within the same time and processor bounds using segmented copy-scans, 
with changes in the x vector marking the segment boundaries. The 2d-ranking within the second step costs 
O(logTV') time on N PEs [1]. Thus: 

Theorem 1 On the EREW model, given the trace x u x N , the associated stack distances A u A N 
induced under the LRU replacement policy can be computed in O(logJV) time using N PEs. 

Aiming for a simpler implementation and smaller implicit constants, we may sacrifice a logiV factor in 
the running time. The natural parallelization of Bentley’s multidimensional divide and conquer method [3] 
gives a 2d-ranking algorithm that runs in 0( log 2 N) time using N PEs. Using, for example, Batcher’s sorting 
method [2] requires time 0(log 2 N) on N PEs. 


4 Parallel Simulation of LRU Level by Level 


An alternative approach is to simulate LRU level by level, at the i th iteration computing the level i cache 
contents Si(»), . . ., s N (i) and stack distances A^t), . . . , A N (i). Assuming a set size of C, the final results 
are the stack distances Ai(C), . . Atv(C). 

Define reference x t to be a prior hit ( prior miss) at level i if x t is a hit (miss) given that the cache size is 
* — T That is, x t is a prior miss at level i if x t & — 1). If x t is a prior hit at level i then A<(i — 1) < t; 

otherwise A<(i) = oo. In Figure 3, we have marked the prior hits x t at level 3 by underscoring the symbol at 
level 2 in column t. In studying this figure one should remember that an underscore on symbol s,(i) means 
that symbol x t was a hit in a (* — l)-line cache, not that s t (i) was. The placement of underscores was chosen 
to highlight the propagation of a symbol across a sequence of prior hit positions, to be described below. For 
example, of the first ten references four are prior hits at level 3— x 4 , x 5 , x 7 , and z 8 — because c (= x 4 ,x 5 ,x 7 ) 
is found in 5 3 (2), B 4 (2) and B 6 ( 2), and symbol a (z 8 ) is found in B 7 ( 2). 

Any prior hit at level i - 1 is also a prior hit at level i (for example, j; 5 in Figure 3). Under LRU, the 
other prior hits at level i are the references x t satisfying x t = - l) ; i. e ., the references that hit at the 

last level of the size i - 1 cache (for example, x 4 in Figure 3). 

The key to the simulation method is that under LRU, for all t > 1 and i > 1, 


«*(*) 


S(-i(t — 1) if x t is a prior miss at level i 
St-i(i) otherwise 


( 2 ) 


where 


s t (l) = X(,S O (0 = 0. 
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Figure 3: LRU acting on an 18 line trace, assuming a cache size C — 3. Each prior hit x t at level 3 is 
identified by underscoring s t (2). 

To see this, suppose x t & — 1). The LRU rule puts x t into level one, and shifts lines 1, 2, ..., i — 1 

down one level, which pushes — 1) to level i. On the other hand, if x t is a prior hit at level the cache 

update leaves level i unchanged. In Figure 3, we see that the 3 rd and the 6 th references are prior misses at 
level 3, and that the intervening references are prior hits. As a result, $2(2) = a enters level 3 at t = 3 and 
propagates over prior hits at level 3 until t = 6, where it is replaced with s 5 (2) = 6, which in turn propagates 
up through t = 8. 

We now describe the simulation algorithm, taking special care with the details because similar meth- 
ods are needed in Section 5. At the (? — l) a * iteration we will overwrite vectors s = (so, si, . . . , s^r) 

and d =: (di, d 2 , . . . , djv) with the level i cache contents and stack distances, ($o(z), $ 1 ( 2 ), . . . , s^(i)) and 

(Ai(i), A 2 (i), . . . , Ajv(i)), respectively. A vector x = (xi , x 2 , . . . , x^) holds the trace (a?i , x 2 , . . . , xn), and 

another vector u = (u x , . . ujv) will hold a copy of (s 0 (i - 1 ) f si(i - 1), . . ., s N -i(i - 1)). To initialize the 

computation, for t = 1, . . . , N t set d f = 00 , and s t = x t , s 0 = 0. For i = 1, . . . , C, do as follows. 

1. [Update the Level i Cache Contents via equation (2).] For t = 1, . . . , N, set u t = For t = 1, . . . , 

N, if d t ^ 00 then set s* = s*_i; otherwise, s t = u t . This is to be understood, but not implemented, 
as a serial update: first s\ is updated, then $ 2 , a °d so forth. 

2. [Update the Stack Distances.] For t = 1, . . . , A, set d* = i if d* = 00 and u* = x*; otherwise leave d t 
unchanged. 

The right shift of s into u in step 1 and the update to the stack distances in step 2 are naturally parallel 
operations. The update of s is a segmented copy-scan, with the coordinates t with d t = 00 marking the 
segment boundaries. Hence, the cost of both steps is just 0(\ogN) time using N/\ogN PEs. As there are 
a total of C iterations to perform, we obtain: 

Theorem 2 On the EREW model, given the trace x\, . . . , xn, the associated level C stack distances Ai(C), 
..., An(C) under the LRU replacement policy can be computed in 0(C logN) time using N/ log N proces- 
sors . 

We implemented this algorithm on a MasPar MP-1 computer [4], with 16384 PEs. Each PE is a 4-bit 
processor with a clock cycle of 80 nanoseconds. A typical integer operation such as those common in our 
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c 

2 14 

2 15 

2 16 

2 17 

2 18 

Trace Length 
219 2 20 

2 21 

2 22 

2 23 

2 24 

2 25 

4 

3.7 

3.4 

3.9 

5.0 

7.3 

11.4 

20 

37.1 

71.1 

139 

275 

548 

8 

8.6 

8.4 

9.6 

12 

16.7 

25.8 

44.3 

81.3 

155 

303 

599 

1190 

16 

18.6 

18.5 

20.9 

25.8 

35.5 

54.5 

93 

170 

324 

630 

1245 

2474 

32 

38.5 

38.7 

43.6 

53.4 

73.1 

120 

190 

347 

660 

1286 

2539 

5043 


Table 1: Execution time of the LRU algorithm on a MasPar MP-1 with 2 14 PEs, in milliseconds, as a function 
of trace length and set size 


algorithm requires a few ten's of clocks. 

Our implementation supports “super-saturation" of the PE’s, as described earlier. The PE memory size 
permit us to assign as many as 2048 references to each PE, thereby permitting the simultaneous simulation 
of a trace with over 33 million references. The performance data we present includes only the time spent in 
the solution phase of the algorithm. The traces were generated randomly. For a given trace length and cache 
set size we observed a 10-15% increase in running time between caches with a very low hit ratio, and caches 
with a high hit ratio. This is likely due to the fact that long segments accompany high hit ratios, requiring 
greater inter-PE communication to implement the copy-scan. The timings presented are from traces with 
nearly perfect hit ratios, and so represent an upper bound on the timing one might expect from an actual 
trace. 

A full implementation would have to spend time loading the trace; the I/O time required depends on the 
available I/O hardware and the organization of the trace on the I/O devices. In light of our timings, it is clear 
that moving the trace onto the machine may well be the most serious bottleneck an actual implementation 
would face. 

Our experiments vary the length of the trace from 2* 4 to 2^, and the set size C from 2 to 32. The 
presented timings are averages, given in milliseconds, taken by executing the solution loop many times in 
succession. 

Observe that about five seconds of execution time were required to analyze the behavior of a 32-line set 
on a trace with 2 25 = 33,554,432 references. This is 710 times faster than the solution time (with trace 
generation costs subtracted off) of an optimized serial algorithm we implemented on a Sparc-1+ workstation. 
These timings demonstrate the remarkable promise of massive parallelism for trace-driven cache simulation. 


5 Reference Based Replacement Policies 

We now broaden the scope of our methods, to handle a large class of line replacement policies, which we 
term reference-based. In Section 5.3, we extend the class to handle policies that allow priorities associated 
with cached lines to “age” so that stale lines tend to be flushed. 
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Mattson et al. [16] show that a stack policy is obtained if a numerical priority P(s t (i)) is assigned to each 
line s t (i) at reference x t) and the line yt(C) chosen for replacement on loading x t is the one with least priority 
among the members of B t -x(C). In the class of policies we now consider, a line’s priority is established at 
the point it appears in the reference stream, after which it remains constant until the line is referenced again. 
Of course we must be able to calculate the priorities from the reference trace. Thus we limit attention to 
policies that support efficient parallel priority calculations. As practical policies seem to use very simple 
priority assignments, this limitation is mild. Here, “efficient” means within the resource bounds needed for 
the rest of our simulation method: 0(\ogN) time using N processors. Recall (Section 2.2) that 0(\ogN) 
means O(log N) with high probability. 

Let us define the class of reference-based replacement policies as those stack policies induced by priorities 
satisfying the following conditions. 

Rl: All P{x t ) values can be computed quickly in parallel: in O(logTV) time using N PEs. For example, 
the priorities for LFU (Least Frequently Used) can be established with a sort on the reference tags, 
followed by a segmented sum-scan. 

R2: A line’s replacement priority does not change except when the line appears in the reference stream. 

Several important replacement policies are reference-based, including 

• LRU: P(x t ) = t. 

• LFU: P(xt) = Count (xt,f), the number of references x u = x t for u < t. Ties can be broken, for 
example, by lexicographic ordering of the lines, or by giving higher priority to the line that has been in 
the cache the shortest length of time. (P(x t ) = Count(x t , *)- l/(t + 1) would serve the latter purpose.) 

• OPT: P(xt) is the negation of the smallest index u > t such that x u = Xt • 

In addition, the Random replacement (RR) policy shares most of the properties we need to quickly simulate 
reference-based policies, and we include it in this class as a special case. Under RR, priorities are chosen 
that determine a uniform random ranking of the cache contents; details are given below. 

Figure 4 gives an example of the operation of LFU with ties broken by lexicographic ordering, a < b < 
c < d. A line’s subscript equals the number of earlier references to the line. First, note that the stack order 
and the priority order may differ. A line with low priority can be buried in the middle of the stack order, 
for example, line d at t = 13. There are important departures from the behavior of the LRU policy. Under 
LRU, the replaced line is the one at the lowest stack level. Here, we see that if the cache size is 2 then at 
t = 9, line d misses and replaces line a at the first stack level, leaving line c in place at the second stack level. 
To illustrate the entry and propagation of lines across a given level, each prior miss x t at level 3 is marked 
by underscoring s t ( 2). As in LRU, a line propagates across all prior hits. Unlike LRU, a line may propagate 
across some prior misses. For example, line a enters level 3 at t = 9 and propagates until t = 15, across the 
prior misses at t — 10 and t = 12. 
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Figure 4: LFU; ties are broken by a < b < c < d. The subscripts count the number of earlier references, and 
underscores mark prior hits at level 3. Also shown are the stack distances A* (2) at level 2 and the lowest 
priority lines y t ( 2) £ B t _i( 2) at level 2. 


By convention, the priority of 0, the empty line marker, is — oo. As before, we assume so(i) = 0 for 
i — 1, . . , , C. It follows from the analysis of [16] that a stack policy is induced by the rule stating that the 
line of least priority is selected for replacement. We refer to line yt{C) as a replacee , and define yt{C) to be 
the least priority line in if B t ~i(C) is full; i.e., |Bf-i(C)| = C. If Bt~\{C) is not full then we let 

yt(C) - 0. In studying our notation it is important to remember that y t (C) refers to a line with a particular 
property in the cache after reference not after x t . This convention follows Mattson et al. [16]. 

A simple recurrence determines the level i cache contents. Given two cached lines a and /?, let maxp {a,/?} 
select the one with higher priority. For all t > 0, s f (l) = x t . For all i > 1 and t > 0, 


s t {i) 


st-i(i) if Xt is a prior hit at level i 

< y t (i — 1) if x t is a prior miss at level i and s t -i(i) = x t 

maxp{y f (i — 1), otherwise 


(3) 


Notice that the only lines that ever enter level i are the least priority lines y t (i — 1) from lower levels. 


5.1 Parallel Simulation Level by Level 

In this section, we present a rapid parallel simulation algorithm for any reference- based replacement policy. 
For any given C > 0, the objective is to compute the level C stack distances Ai(C), A^(C). The 
algorithm works level by level, like the LRU simulation algorithm presented in Section 4. Specifically, we 
compute the cache contents s t (i ), the replacees y t (i) and the stack distances A<(z) at level *, given these 
same quantities at level i — I. As in the LRU simulation, just 0(1) space per reference is needed, with the 
results of the level i computation overwriting those for level i — 1. 

Level 1 is easy: s t { 1) = x tt Vt( 1) = St-i(l), A*(l) = 1 if s t ( 1) = s t _i(l), and A,(l) = oo otherwise. Two 
facts, which follow from equation (3), are crucial to our approach for computing the desired results for level 

i > 1: 
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• If a new line, say a, enters level i > 1 at time t (meaning s t (i) = a, / a) then x t must be a 

prior miss at level i and a must be the replacee y t (i — 1). 

• Assuming a = y t (i - 1) does enter level i > 1, it propagates until coming to the first reference 
x u, u > t, where either x u = a or is a prior miss and P(y t (i — 1)) < P(y u (i — 1)). That is, 
5 *(0 = • •• — s u -i(i) = a. If u < TV + 1 then replacee y u (i — 1) enters level i at time u. If there is no 
such u < TV -f 1 then the replacee yt(i — 1) propagates on through time A r . For every reference x t let 
u(t) denote the index so identified. 

Consider the graph where the vertices are the indices 1 of all prior misses x t and there is an edge from 
t to u(t). This edge records the fact that if y t (i - 1) enters level i then s t (i) - ... = 5 u ( t )_!(t) = y t (i - 1), 
whereupon it is replaced by y u ( t )(i~ 1) (if u(t) < A r + 1). Observe that not all such y t (i- 1) actually do enter 
level i — for instance, in Figure 4, y i0 (2) = d does not enter level 3, because P(y 10 (2)) < P(s 9 (3)). However, 
the set of references that do actually enter level i can be determined by following the maximal path through 
the graph, starting at vertex 1. Replacee y\{i) = 0 enters at time 1 and propagates until time v = u(l) — 1. 
If v < N + 1 then replacee y v +i(i - 1) enters level i, and propagates until time w = u(v) - 1. If w < A^ -f 1 
then replacee y w ^.\ (i — 1) enters level i } and so forth. 

Converting this serial process for simulating level i into a parallel one, we simulate level i > 1 as follows: 

1. [Compute tentative propagation intervals.] For each prior miss compute 

stop(t) = least s > t such that P(x s ) > P(x t ) t and x s is a prior miss at level i (4) 

next(t) = least s > t such that x s = x t , 

where, by convention, stop(t) and next(t ) equal N 4- I if the index s above does not exist. Set 

p(t) = min{nex^(f), stop(t)}. By earlier remarks, if y t (i-l) enters level i then s t (i) = . . . = = 

2/t(* — 1). 

2. [Follow propagation chain,] The pointers u(/) determine the replacees that enter level i, namelv, 
those with indices: 0, u(0), u(u(0)), . . ., w T ith the sequence stopping at v = u(. . .u(0) . . .) ^ A^ -f 1, 
u(u) = N + 1. Mark these replacees. 

3. [Compute level i results.] For each marked replacee y t (i - 1) and each v in the interval [t, u{t) - 1], set 

s v (0 = Vt(i — 1)- For every x t set y t +i(i) to maxp{s t (i), yt + \(i — 1)}. Following this, if x t is a prior 

miss at level i — 1 and if = xt set Af(i) = i. Set A^(i) = At(i — 1) for prior hits. 

Computing the stop(t) values is an instance of the closest larger right neighbor (CLRN) problem, discussed 
in Section 2.2. It can be solved in 0(log N) time using N PEs. A simpler 0(log 2 N) time solution is described 
below. Computing the next(t) values via sorting is described in Section 3; the time needed is 0(logN) using 
A r PEs. Marking the replacees on the chain of pointers from 0 to TV -j- 1 is a pointer jumping problem [14], 

which can be solved in O(logTV) time using TV/ log N PEs. The final step of updating the level i cache 
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Figure 5: Search tree identifying maximum values over subintervals, which is used to solve the nearest right 
neighbor problem. 

contents and stack distances is essentially the same as was done for LRU simulation in Section 4. We do not 
repeat the details. The time needed is O(logJV) time using N/ log AT PEs. Summing up, the total time is 
O(log N) using N PEs. To produce the level C results, the computation must be repeated for each i = 2, 

. .., C. Thus, 

Theorem 3 On the EREW model, given the trace x 1: . . . , x N , the associated level C stack distances A i(C), 
A,v(C) induced by any reference-based replacement policy can be computed in time 0{C\ogN) using N 

PEs. 

We close this section with a simple method for solving the CLRN problem. This method plays the key 
role in generalizing the simulation method to accommodate priority aging (Section 5.3). 

Let ai, ..., ajv be a sequence of N numbers, and, for simplicity, assume that N is a power of two and 
a N - +oo. For each a {> i = 1, . . . , N - 1, we wish to find the closest right larger neighbor a,-; i.e., j > i is 
as small as possible and a, < aj. Construct a binary tree over the inputs as illustrated in Figure 5. Each 
node is labeled with the maximum value of the inputs in its subtree and with the corresponding subrange 
of indices. To find the closest larger right neighbor aj of a, a two phase search is initiated. Phase one starts 
at the leaf node a,-, and progresses in steps up the tree. At each step we move from the present node to the 
nearest internal node at the next higher level whose span includes a node to the right. This phase ends upon 
visiting (i) an internal node which is rightmost at its level, or (ii) a node whose value is (strictly) greater 
than at . In the example of Figure 5, the first phase for a 3 visits nodes representing ranges [3, 4] and [5, 8], In 
general, the first phase stops at a node spanning an interval [m + 2*, m + 2 i:+1 ], with i < m + 2 k , which must 
contain the sought aj. In phase two the search descends down to aj, at each step moving to the left child if 
the left child’s value exceeds a„ or to the right child otherwise. On a concurrent read model, carrying out 
all N searches in parallel gives an 0(log N) time solution. On an exclusive read model, standard methods 
[14] for resolving the read conflicts add an 0(log N) factor, bringing the total time to 0( log 2 N). 
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5.2 Random Replacement 

The Random Replacement (RR) rule selects the line to replace on a cache miss independently and uniformly 
at random from the set of cached lines. Mattson et al. [16] observed that the selections can be made so as 
preserve the stack property ( B t (i — 1) C B t (i ); for all t , i > 1), by coupling the random decisions as follows. 
Suppose the members of B t (i — 1) have been ranked randomly from 1 to i — 1, with the understanding 
that the higher a line’s rank the lower its priority. To build B t (i) } insert s t (i ) into the priority structure by 
randomly choosing an integer rank for it from [1, *], say k. Members of Bt(i — 1) with ranks > k have their 
ranks incremented by one in B t (i — 1). Other members of B t (i — 1) retain their rank. Thus, for each prior 
miss x t at level i, we may decide the line yt{t) of least rank in B t (i) by a coin toss: with probability l/i, 
line y t (i) = St_i(i), and with the complementary probability y t (i) = y t (i — 1). Moreover, the outcome is 
completely independent of 

This independence can be exploited to considerably simplify the simulation of level i over that described 
above. Step 1 becomes: For each prior miss x t , compute nexi(i) as before, and use a coin toss to decide 
whether to label index t as a “stopper” (probability 1 ji) or leave the index unlabeled (probability 1 — 1/i). 
Let stop(t) be the least stopper u > t, or N + 1 if no such a exists. Let u(tf) = min {stop(t), next(t)} as 
before. Steps 2 and 3 remain the same. 

Computing the stop(t) values entails N independent coin tosses and a segmented copy-scan (cf. Section 
2.2), operations that net 0(log N) time using N PEs, As a result, we obtain 

Theorem 4 On the EREW model, given the trace X\, . . . , x^ } the associated level C stack distances Ai(C), 

. . . , Aat(C) induced by the RR policy can be computed in time 0(C log N) using N PEs. 

Comparing with Theorem 3 we see that dropping the CLRN problem and the probabilistic algorithm 
used to solve it strengthens the running time bound from one that holds with high probability to one that 
holds deterministically. 

5.3 Priority Aging 

Under any reference-based replacement rule other than RR, a line’s priority is fixed when it enters the 
cache. If the policy is a practical one, then it is likely that the priority is a simple function of the previous 
references. For example, under LFU a line’s priority is the number of earlier references to the line. Past 
cache activity gives an imperfect indication of future cache activity. Under LFU a flurry of references to 
a small set of lines might lead to their long retention during a subsequent period when the lines are not 
needed. To counter this, it is natural to consider policies that allow a line’s priority to age; i.e., to decrease 
monotonically while the line remains unreferenced. In this Section, we extend our reference-based simulation 
method to accommodate aging. 

Let (j> : H —*■ FI be a monotonically decreasing operator, and let <p d represent the d-fold application of 
<f>. Let Pt(ct) denote the priority of a line a held in the cache at the time of x t . We consider replacement 
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policies where the initial priority of line x t is reference- based (Pt(x t ) satisfies R1 of Section 5), but the line's 
priority “ages” to P t +d(x t ) — <p d {P t (xt)) in the cache B t +d(C) if it remains unreferenced throughout time 
t + 1, . . . , t + d. As before, the replacement policy always selects the line with least priority. Some natural 
aging operators <f> are <p(x) = x — a for some fixed a > 0 or <f>(x) = ax for some fixed a £ (0, 1). We assume 
that for any d — 1, N — 1, <p d (x) can be computed in 0(1) time. 

Equation (3) describing the evolution of the stack levels continues to hold. A little thought shows that 
to adapt the simulation method to accommodate aging, we need only change the definition of stop(t) in 
equation (4) to 

stop(t) = least s > t such that P s (x 3 ) > <p*~ t (Pt(xt)) } and x s is a prior miss at level i . 

Computing these new values stop(t) can be posed as the following variant of the CLRN problem. Let a if 
a N be a sequence of N numbers, and, for simplicity, assume that N is a power of two and ay = -foo. For 
each i = 1 , . . . , N — 1, we wish to find the smallest j > i such that 

< aj . 

We now sketch how to extend the binary search solution given in Section 5 to solve the new problem. 
Let <p denote the inverse of <j > . Since (p is monotone increasing, the inequality above implies to 

<p u (ai) < <p v (aj ) for all nonnegative integers u, v with u + v = j — i. 

Letting [u, v] be any range of indices with ri > i ) 

<p j ~\a i ) < aj for all j £ [ti,v] <p u ~\ai) < max{a U) ^(a u+i ), . . . , <j> v ~ u (a v )}. 

This equivalence reveals a way to determine whether a, ages below some a ; , j £ and i < u\ i.e., 

whether C^“*(a,) < aj. For j > u define to be the “rejuvenated" value of aj with respect to index 

u (i.e., aj’s priority if “de-aged" back to position u). Let r u>v = max{a u , <f>(a u +i 4> v ~ u {a v )} denote the 
maximum (over j £ [u,v]) rejuvenated value of any aj with respect to u. To find out if a, ages below some 
aj with j £ we may simply compare ^ u “*(a,-) and r UfV . Since a< is arbitrary in this discussion, it is 

possible to use r UfV concurrently in many searches. 

A “rejuvenation-max” tree can be built in parallel using the following observation: for any M (assumed 
to be a power of 2) 

= m a x{ i< max /2 {^ , - 1 (a,)}, ^ M/2 ( i< max^{^- 1 (a M/2+1 )})} 

This recursion shows we can build a rejuvenation- max tree over ax, . . . } ay in log vV steps. At every step, 
all nodes at a given level of the tree are constructed. Leaves are understood to be at level log N, the root 
is at level 0. A node spanning an interval [u,v] is labeled with r UyV . The recusion shows that to compute 
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Figure 6 : Rejuvenation-max tree, which is used to solve the nearest right neighbor problem when aging is 
permitted. In this example, the aging operator <f>(x) = x/2. 

the label of a node at level k y one computes the maximum of (i) the label on the node’s left-child, and (ii) 
the label on the node’s right-child promoted by an operator <fi c , with c = 2 ]o z N ~ k . Thus, the parent of a left 
node with value a and right node with value 6 has label max{a, <^ c ( 6 )}. Figure 6 illustrates how the tree of 
Figure 5 is modified to accommodate the aging operator = x/2. 

Given a* we wislvto find the closest aj to the right such that < a j . The same two phase strategy 

described in Section 5 works, replacing the comparison of the value a,* with the label of a node spanning an 
interval [u, u] with the comparison of with r UtV} or an equivalent comparison. For example, consider 

a 3 ’s search for the setup of Figure 5. In the first phase, the search moves up and to the right in the tree, 
looking over successively larger intervals for a value that 03 ages below. First, we compare a 3 = 6 with 
r 3f 4 = 6 , and since 03 is not smaller continue the first phase. The next node visited represents [5,8]. We 
compare <£ 2 (a 3 ) = 1.5 with rs g = 32, and as 03 is smaller, phase one stops at this node. In the second phase, 
the search moves down from [5,8] to locate the leftmost a ; in [5,8] that 03 ages below. First, we branch 
left to [5,6], because r 5(6 = 14 is larger than <£ 2 (a 3 ) = 1.5. Second, we branch left again to [5,5] because 
r 5j 5 = 3 > <£ 2 (a 3 ) = 1.5. Since [5,5] is a leaf, the search stops, having located the right match, a 5 , for a 3 . 

Building the rejuvenation-max tree costs O(logjY) time using N PEs. On the CREW model, we may 
assign one processor to the search for each input, and so obtain an O(logTV) time solution using A r PEs. 
This in turn implies an 0(log 2 TV) time solution using N PEs on the EREW model. The final result is: 

Theorem 5 On the EREW model , given the trace x Xf the level C stack distances Ai(C) t • • 

A jv(C') induced by any reference-based policy with aging can be computed in time 0(0 log" TV) using N PEs. 

6 Summary 

Trace driven cache simulation is an important tool used in the design of computer systems. Parallel pro- 
cessing offers the promise of reducing the time required to execute a cache simulation, and hence reduce the 
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overall cache design time. We have shown how massively parallel SIMD architectures can be applied to this 
important problem area. 

We note that the bottleneck problem in our simulation of reference- based replacement policies is the 
closest larger right neighbor problem. An improvement in the solution of this problem to 0(\ogN) time 
(without using probabilistic methods) and N PEs on the EREW model appears possible [7], This would 
improve the reference-based simulation method to deterministic 0{C \og N) time using N PEs. 

A number of important issues remain. There is a class of “clock-based” stack algorithms which do not 
appear to fit within our framework. The classical clock algorithm [6] associates one bit with each physical 
line in the cache. The bit is set whenever a new line is written into the physical location. A clock counter 
determines replacement lines. On a hit the counter is untouched, but on a miss, the counter scans the set 
for a clear bit; the first clear bit found identifies the replacement line. The scan begins where the counter 
was last left, and any set bit encountered in the scan is cleared. Thus, at most one scan of the set is needed 
to find a clear bit. The clock algorithm is induced if we assign priority d to line r if d lines must be scanned 
by the clock before choosing r as the replacement. A line’s priority ages as it sits in the cache, but in a 
highly state-dependent way. One object of our future research is to determine whether clock-based stack 
algorithms can be simulated in parallel. 

We believe that the geometric methods used to obtain the feist, set size independent LRU simulation 
method of Section 3 might yield similar simulation methods for OPT and for general reference-based policies. 
Another important issue is whether these techniques can be extended to the simulation of multiprocessor 
caches. Yet another issue is the use of SIMD processors to generate synthetic cache traces. The method 
discussed in [20] is basically LRU “in reverse”: given the stack distances, compute the reference string. We 
believe we can implement this method in poly-log time using ideas similar to those developed here. Given 
the promise of SIMD trace-driven simulation, a more comprehensive study of parallelized synthetic trace 
generation will be useful. 
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